Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify checkpointer and make it work for large models #628

Merged
merged 17 commits into from
Feb 20, 2020

Conversation

ali-ramadhan
Copy link
Member

@ali-ramadhan ali-ramadhan commented Feb 16, 2020

This PR finally upgrades the checkpointer so it can restore large models that take up more than 50% of system memory. It used to create a model then restore the fields which allocates twice as much memory as needed. Now the data needed to restore the fields is passed to the model constructor so no double allocation. Some refactoring had to happen to make this possible.

This PR is also part 2/3 of making boundary conditions a field property.

Should help a lot with #602 and #603.

Resolves #416
Resolves #417

Note: This PR branches off #627.

@ali-ramadhan ali-ramadhan added performance 🏍️ So we can get the wrong answer even faster cleanup 🧹 Paying off technical debt GPU 👾 Where Oceananigans gets its powers from labels Feb 16, 2020
@glwagner
Copy link
Member

@ali-ramadhan I will wait until part 1/3 is merged to review this.

@ali-ramadhan
Copy link
Member Author

@glwagner Just merged part 1 (#626) so this PR is now a lot smaller and should be easier to review.

@codecov
Copy link

codecov bot commented Feb 19, 2020

Codecov Report

Merging #628 into master will decrease coverage by 4.67%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #628      +/-   ##
==========================================
- Coverage   75.13%   70.45%   -4.68%     
==========================================
  Files         118      118              
  Lines        2280     2288       +8     
==========================================
- Hits         1713     1612     -101     
- Misses        567      676     +109     
Impacted Files Coverage Δ
src/Solvers/solver_utils.jl 0.00% <0.00%> (-100.00%) ⬇️
src/Solvers/channel_pressure_solver.jl 30.13% <0.00%> (-68.50%) ⬇️
...c/Solvers/horizontally_periodic_pressure_solver.jl 41.50% <0.00%> (-56.61%) ⬇️
src/Solvers/plan_transforms.jl 66.66% <0.00%> (-33.34%) ⬇️
src/Fields/set!.jl 36.53% <0.00%> (-25.23%) ⬇️
src/Architectures.jl 66.66% <0.00%> (-22.23%) ⬇️
src/Fields/field.jl 69.69% <0.00%> (-9.10%) ⬇️
...nditions/solution_and_model_boundary_conditions.jl 89.18% <0.00%> (-5.41%) ⬇️
src/Fields/field_tuples.jl 78.94% <0.00%> (-5.27%) ⬇️
src/Buoyancy/Buoyancy.jl 57.89% <0.00%> (-5.27%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ef4d11...48e5424. Read the comment docs.

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions / comments:

This PR finally upgrades the checkpointer so it can restore large models that take up more than 50% of system memory. It used to create a model then restore the fields which allocates twice as much memory as needed.

This is only relevant for CPU models --- correct? In other words, when running on the GPU we still need to load data onto the CPU and then transfer to the GPU. Previously, the option was available to create a model on the GPU, load data on to the CPU, and then copy that data to the GPU --- right? Or am I missing something?

edit: for example:

julia> using CuArrays

julia> a = CuArray{Float64}(undef, 10, 10); b = rand(10, 10);

julia> copyto!(a, b)

Does this PR impact the user API for checkpointing at all, or does it just change restore_from_checkpoint, which does everything behind the scenes?

We may want to move the section on checkpointing that is currently in the documentation at Model setup > Output writers to its own section within the documentation.

src/OutputWriters/checkpointer.jl Outdated Show resolved Hide resolved
@ali-ramadhan
Copy link
Member Author

This is only relevant for CPU models --- correct? In other words, when running on the GPU we still need to load data onto the CPU and then transfer to the GPU.

That is technically true although in general you tend to have a lot more CPU memory than GPU memory so I suspect it won't be an issue there. You could reduce memory allocation even further by loading all fields from disk into a temporary array, then copying to a CuArray one field at a time. But I don't think that'll work with JLD2 as the entire file is loaded into memory at once. You could do it with chunked NetCDF files, for example, where you read specific chunks into memory at a time.

Previously, the option was available to create a model on the GPU, load data on to the CPU, and then copy that data to the GPU --- right? Or am I missing something?

You could do that but no user would have gone through the trouble.

Does this PR impact the user API for checkpointing at all, or does it just change restore_from_checkpoint, which does everything behind the scenes?

Just changes what happens behind the scenes. The API is the same but checkpoint files now have file names like convection_iteration12500.jld2 instead of just convection_12500.jld2 although maybe we should use the word checkpoint in the file names.

We may want to move the section on checkpointing that is currently in the documentation at Model setup > Output writers to its own section within the documentation.

Sounds like a good idea, will do before merging.

@ali-ramadhan ali-ramadhan merged commit 4ed3660 into master Feb 20, 2020
@ali-ramadhan ali-ramadhan deleted the ar/checkpointer-upgrade branch February 20, 2020 14:07
@glwagner
Copy link
Member

You could do that but no user would have gone through the trouble.

Well, fair, I was more commenting that "we" could have done that behind the scenes for the GPU case.

That is technically true although in general you tend to have a lot more CPU memory than GPU memory so I suspect it won't be an issue there.

I'm still confused about how this changes what was previously done. Did we previously create arrays on the GPU, copy checkpoint from the CPU to "temporary" GPU arrays, and then copy from temporary GPU arrays to previously-instantiated model data? Or am I missing something (is it possible to load data from disk directly to the GPU?)

@ali-ramadhan
Copy link
Member Author

I'm still confused about how this changes what was previously done. Did we previously create arrays on the GPU, copy checkpoint from the CPU to "temporary" GPU arrays, and then copy from temporary GPU arrays to previously-instantiated model data?

Yeah this is what we were previously doing in restore_fields!:
https://github.com/climate-machine/Oceananigans.jl/blob/8352e56f5839b23d3441f6f8bd0f297f3e0b508f/src/OutputWriters/checkpointer.jl#L102

which is kinda stupid stupid because you create a temporary CuArray then copy elements over (we already allocated memory for the fields in the model). We could have done it with a copyto!(::CuArray, ::Array) which would avoid unnecessary allocations except for one temporary array.

Now we construct the fields with the restored data and pass it to the model constructor so there are no temporary arrays and zero unnecessary allocations.
https://github.com/climate-machine/Oceananigans.jl/blob/4ed366019c3f6a9b3ba9cf19691fef721204ea3c/src/OutputWriters/checkpointer.jl#L110-L114

is it possible to load data from disk directly to the GPU?

Hmmm, not sure but doesn't seem impossible. At the lowest level it'll have to do some host to device copies though I think. @leios or @vchuravy would know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup 🧹 Paying off technical debt GPU 👾 Where Oceananigans gets its powers from performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
2 participants